An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Proyag Pal, Jousia Piha, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, Jaume Zaragoza-Bernabeu
Correct Metadata for
Abstract
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.- Anthology ID:
- 2025.acl-long.854
- Volume:
- Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria
- Editors:
- Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 17452–17485
- Language:
- URL:
- https://aclanthology.org/2025.acl-long.854/
- DOI:
- 10.18653/v1/2025.acl-long.854
- Bibkey:
- Cite (ACL):
- Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Proyag Pal, Jousia Piha, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, and Jaume Zaragoza-Bernabeu. 2025. An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT). In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17452–17485, Vienna, Austria. Association for Computational Linguistics.
- Cite (Informal):
- An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) (Burchell et al., ACL 2025)
- Copy Citation:
- PDF:
- https://aclanthology.org/2025.acl-long.854.pdf
Export citation
@inproceedings{burchell-etal-2025-expanded, title = "An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ({HPLT})", author = {Burchell, Laurie and de Gibert, Ona and Arefyev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Chen, Pinzhen and Fedorova, Mariia and Guillou, Liane and Haddow, Barry and Haji{\v{c}}, Jan and Helcl, Jind{\v{r}}ich and Henriksson, Erik and Klimaszewski, Mateusz and Komulainen, Ville and Kutuzov, Andrey and Kyt{\"o}niemi, Joona and Laippala, Veronika and M{\ae}hlum, Petter and Malik, Bhavitvya and Mehryary, Farrokh and Mikhailov, Vladislav and Moghe, Nikita and Myntti, Amanda and O{'}Brien, Dayy{\'a}n and Oepen, Stephan and Pal, Proyag and Piha, Jousia and Pyysalo, Sampo and Ram{\'i}rez-S{\'a}nchez, Gema and Samuel, David and Stepachev, Pavel and Tiedemann, J{\"o}rg and Vari{\v{s}}, Du{\v{s}}an and Vojt{\v{e}}chov{\'a}, Tereza and Zaragoza-Bernabeu, Jaume}, editor = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher", booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2025", address = "Vienna, Austria", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.acl-long.854/", doi = "10.18653/v1/2025.acl-long.854", pages = "17452--17485", ISBN = "979-8-89176-251-0", abstract = "Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value." }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="burchell-etal-2025-expanded"> <titleInfo> <title>An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)</title> </titleInfo> <name type="personal"> <namePart type="given">Laurie</namePart> <namePart type="family">Burchell</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ona</namePart> <namePart type="family">de Gibert</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Nikolay</namePart> <namePart type="family">Arefyev</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mikko</namePart> <namePart type="family">Aulamo</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marta</namePart> <namePart type="family">Bañón</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pinzhen</namePart> <namePart type="family">Chen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mariia</namePart> <namePart type="family">Fedorova</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Liane</namePart> <namePart type="family">Guillou</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Barry</namePart> <namePart type="family">Haddow</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jan</namePart> <namePart type="family">Hajič</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jindřich</namePart> <namePart type="family">Helcl</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Erik</namePart> <namePart type="family">Henriksson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mateusz</namePart> <namePart type="family">Klimaszewski</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ville</namePart> <namePart type="family">Komulainen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Andrey</namePart> <namePart type="family">Kutuzov</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joona</namePart> <namePart type="family">Kytöniemi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Veronika</namePart> <namePart type="family">Laippala</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Petter</namePart> <namePart type="family">Mæhlum</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bhavitvya</namePart> <namePart type="family">Malik</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Farrokh</namePart> <namePart type="family">Mehryary</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vladislav</namePart> <namePart type="family">Mikhailov</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Nikita</namePart> <namePart type="family">Moghe</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Amanda</namePart> <namePart type="family">Myntti</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dayyán</namePart> <namePart type="family">O’Brien</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Stephan</namePart> <namePart type="family">Oepen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Proyag</namePart> <namePart type="family">Pal</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jousia</namePart> <namePart type="family">Piha</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sampo</namePart> <namePart type="family">Pyysalo</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Gema</namePart> <namePart type="family">Ramírez-Sánchez</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">David</namePart> <namePart type="family">Samuel</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pavel</namePart> <namePart type="family">Stepachev</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jörg</namePart> <namePart type="family">Tiedemann</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Dušan</namePart> <namePart type="family">Variš</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tereza</namePart> <namePart type="family">Vojtěchová</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jaume</namePart> <namePart type="family">Zaragoza-Bernabeu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2025-07</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</title> </titleInfo> <name type="personal"> <namePart type="given">Wanxiang</namePart> <namePart type="family">Che</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joyce</namePart> <namePart type="family">Nabende</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ekaterina</namePart> <namePart type="family">Shutova</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mohammad</namePart> <namePart type="given">Taher</namePart> <namePart type="family">Pilehvar</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Vienna, Austria</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> <identifier type="isbn">979-8-89176-251-0</identifier> </relatedItem> <abstract>Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.</abstract> <identifier type="citekey">burchell-etal-2025-expanded</identifier> <identifier type="doi">10.18653/v1/2025.acl-long.854</identifier> <location> <url>https://aclanthology.org/2025.acl-long.854/</url> </location> <part> <date>2025-07</date> <extent unit="page"> <start>17452</start> <end>17485</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) %A Burchell, Laurie %A de Gibert, Ona %A Arefyev, Nikolay %A Aulamo, Mikko %A Bañón, Marta %A Chen, Pinzhen %A Fedorova, Mariia %A Guillou, Liane %A Haddow, Barry %A Hajič, Jan %A Helcl, Jindřich %A Henriksson, Erik %A Klimaszewski, Mateusz %A Komulainen, Ville %A Kutuzov, Andrey %A Kytöniemi, Joona %A Laippala, Veronika %A Mæhlum, Petter %A Malik, Bhavitvya %A Mehryary, Farrokh %A Mikhailov, Vladislav %A Moghe, Nikita %A Myntti, Amanda %A O’Brien, Dayyán %A Oepen, Stephan %A Pal, Proyag %A Piha, Jousia %A Pyysalo, Sampo %A Ramírez-Sánchez, Gema %A Samuel, David %A Stepachev, Pavel %A Tiedemann, Jörg %A Variš, Dušan %A Vojtěchová, Tereza %A Zaragoza-Bernabeu, Jaume %Y Che, Wanxiang %Y Nabende, Joyce %Y Shutova, Ekaterina %Y Pilehvar, Mohammad Taher %S Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) %D 2025 %8 July %I Association for Computational Linguistics %C Vienna, Austria %@ 979-8-89176-251-0 %F burchell-etal-2025-expanded %X Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value. %R 10.18653/v1/2025.acl-long.854 %U https://aclanthology.org/2025.acl-long.854/ %U https://doi.org/10.18653/v1/2025.acl-long.854 %P 17452-17485
Markdown (Informal)
[An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)](https://aclanthology.org/2025.acl-long.854/) (Burchell et al., ACL 2025)
- An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT) (Burchell et al., ACL 2025)
ACL
- Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O’Brien, Stephan Oepen, Proyag Pal, Jousia Piha, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, and Jaume Zaragoza-Bernabeu. 2025. An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT). In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17452–17485, Vienna, Austria. Association for Computational Linguistics.